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I For high volume data streams and large data warehouses, sampling is used for efficient 

•/^ ■ approximate answers to aggregate queries over selected subsets. Mathematically, we are dealing 

with a set of weighted items and want to support queries to arbitrary subset sums. With unit 
^/^ • weights, we can compute subset sizes which together with the previous sums provide the subset 

I averages. The question addressed here is which sampling scheme we should use to get the most 

• ' accurate subset sum estimates. 

I We present a simple theorem on the variance of subset sum estimation and use it to prove 

variance optimality and near-optimality of subset sum estimation with different known sampling 
schemes. This variance is measured as the average over all subsets of any given size. By optimal 
^ ■ we mean there is no set of input weights for which any sampling scheme can have a better 

^\ I average variance. Such powerful results can never be established experimentally. The results 

' of this paper are derived mathematically. For example, we show that appropriately weighted 

I systematic sampling is simultaneously optimal for all subset sizes. More standard schemes 

such as uniform sampling and probability-proportional-to-size sampling with replacement can 
be arbitrarily bad. 

I Knowing the variance optimality of different sampling schemes can help deciding which 

^ I sampling scheme to apply in a given context. 



1 Introduction 

Sampling is at the heart of many DBMSs, Data Warehouses, and Data Streaming Systems. It 
is used both internally, for query optimization, enabling selectivity estimation, and externally, for 
speeding up query evaluation, and for selecting a representative subset of data for visualization [IB]. 
Extensions to SQL to support sampling are present in DB2 and SQLServer (the TABLESAMPLE 
keyword |13j). Oracle (the SAMPLE keyword [11]), and can be simulated for other systems using 
syntax such as ORDER BY RANDOM() LIMIT 1. Users can also ensure sampling is used for 
query optimization, for example in Oracle (using dynamic-sampling p]). 

Mathematically, we are here dealing with a set of weighted items and want to support queries 
to arbitrary subset sums. With unit weights, we can compute subset sizes which together with the 
previous sums provide the subset averages. The question addressed here is which sampling scheme 
we should use to get the most accurate subset sum estimates. More precisely, we study the variance 
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of sampling based subset sum estimation. We note that there has been sevaral previous works in 
the data base community on samphng based subset sum estimation (see, e.g., [2l [T^ [T5]). 

The formal set-up is as follows. We are dealing with a set of items i G [n] with positive 
weights Wi. Here [n] = {1, n}. A subset S C [n] of these are sampled, and each sampled item 
i is given a weight estimate Wi. Unsampled items i ^ S have a zero weight estimate Wi = 0. We 
generally assume that sampling procedures include such weight estimates. We are mostly interested 
in unbiased estimation procedures such that 

E[wi]=Wi V«e[n]. (1) 

Often one is really interested in estimating the total weight wj of a subset / C [n] of the items, that 
is, wj = 'Ylii^i'^i- As an estimate wi, we then use the sum of the sampled items from the subset, 
that is, wj = X^ig/ ^/ = Z^iG/ns linearity of expectation this is also unbiased, that is, from 

([I]) we get 

E[wi]=wi V/C[n]. (2) 

We are particularly interested in cases where the subset / is unknown at the time the sampling 
decisions are made. For example, in an opinion poll, the subset corresponding to an opinion is 
only revealed by the persons sampled for the poll. In the context of a huge data base, sampling is 
used to reduce the data so that we can later support fast approximate aggregations over arbitrary 
selected subsets [I2l [TSl [T5] . 

Applied to Internet traffic analysis, the items could be records summarizing the flows streaming 
by a router. The weight of a flow would be the number of bytes. The stream is very high volume 
so we can only store samples of it efficiently. A subset of interest could be flow records of a newly 
discovered worm attack whose signature has just been determined. The sample is used to estimate 
the size of the attack even though the worm was unknown at the time the samples were chosen. 
This particular example is discussed in [15j , which also shows how the subset sum sampling can be 
integrated in a data base style infrastructure for a streaming context. In [15] they use the threshold 
sampling from [7] which is one the sampling schemes that we will analyze below. 

Generally there are two things we want to minimize: (a) the number of samples viewed as a 
resource, and (b) the variance as a measure for uncertainty in the estimates. 

For several sampling schemes, we already understand the optimality with respect to the sum of 
the individual variances 

= ^ \ai[w.-^ (3) 

«G[n] 

as well as the variance of the total sum 

yS = Var[^I;[„]] j = Var[ J] i£.,] | (4) 

However, what we are really interested in is the estimation of subsets of arbitrary sizes. 

Before continuing, we note that there is an alternative use of sampling for subset sum estimation 
in data bases; namely where data are organized to generate a sample from any selected subset. 
Generating such samples on-the-fly has been studied with different sampling schemes in [2| [5| [H]. 
When each subset gets its own sample, we are only interested in the variance of totals VS. 

In this paper, we generate the sample first, and then we use this sample to estimate the weight 
of arbitrary subsets. As discussed in [15], this is how we have to do in a high volume streaming 



2 



context where items arrive faster and in larger quantities than can be saved; hence where only a 
sample can be stored efficiently. The sampling first is also relevant if we want to create a reduced 
approximate version of a large data ware house that can be downloaded on smaller device. 



1.1 Performance measure 

The purpose of our sampling is later to be able to estimate arbitrary subset sums. With no advance 
knowledge of the subsets of interest, a natural performance measure is the expected variance for a 
random subset. We consider two distributions on subsets: 

Sm:n denoting the uniform distribution on subsets of size m. 

Sp denoting the distribution on subsets where each item is included independently with probability 
p. 

Often we are interested in smaller subsets with p = o(l) or m = o(n). The corresponding expected 
variances are denoted 

Vm:n = E/^5^^„ [Var 
Wp = E,^5jVarK]] 

Note that Vi-.n = ^V/n and K:n = Wi = VT.. 

We are not aware of any previous analysis of the average variance of subset sum estimation. 



1.2 A basic theorem 

Our basic theorem below states that our subset sum variances are simple combinations of 'EV and 
VE. The quantities T,V and T,V are often quite easy to analyze, and from them we immediately 
derive any Vmm- 

Theorem 1 For any sampling scheme, we have 

Vm:n = " + VE (5) 

n \ n — 1 n — 1 J 

Wp = p{il-p)EV + pVE). (6) 

Theorem [1] holds for arbitrarily correlated random estimators Wiji G [n] with -^[wi] = Wi. That is, 
we have an arbitrary probability space $ over functions w mapping indices i £ [n] into estimates 
Wi. Expectations and variances are all measured with respect to The only condition for our 
theorem to hold true is that the estimate of a subset is obtained by summing the estimates of its 
element, that is, wj = Yli^i'^i- 

One nice consequence of ([5]) is that 

n — m 

Vm-.n > m Vi;n 

n — 1 

This means that no matter how much negative covariance we have, on the average, it reduces the 
variance by at most a factor . 

A nice application of ([6]) is in connection with a random partition into q subsets where each 
item independently is assigned a random subset. A given subset includes each item with probability 
p = so by linearity of expectation, the expected total variance over all sets in the partition is 

q-Wp = {{l-p)EV +pVE) 
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1.3 Known sampling schemes 



We will apply Theorem [T] to study the optimality of some known sampling schemes with respect to 
the average variance of subset sum estimation. Below we first list the schemes and discuss, what is 
known about T,V and VT,. Our findings with Theorem [1] will be summarized in the next subsection. 

Most of the known sampling schemes use Horvitz-Thompson estimators: if item i was sampled 
with probability pi, it is assigned an estimate of Wi = Wi/pi. Horvitz-Thompson estimators are 
trivially unbiased. 

For now we assume that the weight Wi is known before the sampling decission is made. This is 
typically not the case in survey sampling. We shall return to this point in Section 12.21 

Uniform sampling without replacement (U+R) In uniform sampling without replacement, 
we pick a sample of k items uniformly at random. If item i is sampled it gets weight estimate 
Wi = Wiu/k. We denote this scheme U-Rfc. 

Probability proportional to size sampling with replacement (P+R) In probability pro- 
portional to size sampling with replacement, each sample Sj G [n], j G [k], is independent, and 
equal to i with probability Wi/w^n]- We say that i is sampled if z = Sj for some j G [k]. This hap- 
pens with probability pi = 1 — (1 — Wi/w^n])^- If ^ is now sampled, we use the Horvitz-Thompson 
estimator Wi = l/pi- We denote this scheme P-|-Rfc. 

Threshold sampling (THR) The threshold sampling is a kind of Poisson sampling. In Pois- 
son sampling, each item i is picked independently for S with some probability pi. For unbiased 
estimation, we use the Horvitz-Thompson estimate Wi = Wi/pi when i is picked. 

In threshold sampling we pick a fixed threshold r. For the sample S, we include all items with 
weight bigger than r. Moreover, we include all smaller items with probability Wi/r. Sampled items 
i £ S have the Horvitz-Thompson estimate Wi = Wi/pi = ij;,/ min{l, i(;j/r} = max{wj,r}. With 
k = min{l, t(;j/r} the expected number of samples, we denote this scheme THR^. Threshold 
sampling is known to minimize TiV relative to the expected number of samples. 

In survey sampling, one often makes the simplifying assumption that if we want k samples, no 
single weight has more than a fraction of the total weight [20, p. 89]. In that case threshold 
sampling is simply Poisson sampling with probability proportional to size as described in [201 p. 85- 
87]. More precisely, the threshold becomes r = w^n]/k, and each item is sampled with probability 
Wi/t. We are, however, interested in the common case of heavy tailed distributions where a one 
or a few weights dominate the total [U [19] . The name "threshold sampling" for the general case 
parameterized by a threshold r is taken from [TJ. 

Systematic threshold sampling (SYS) We consider the general version of systematic sampling 
where each item i has an individual sampling probability pi, and if picked, a weight estimate Wi/pi. 
Contrasting Poisson sampling, the sampling decisions are not independent. Instead we pick a single 
uniformly random number x G [0, 1], and include i in S" if and only if for some integer j, we have 




h<i 
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It is not hard to see that Pr[i G 5] = pi. Let k = J2ie[n]Pi expected number of samples. 

Then the actual number of samples is either [A;J or [A;] . In particular, this number is fixed if k is 
an integer. Below we assume that k is integer. 

In systematic threshold sampling we perform systematic sampling with exactly the same sam- 
pling probabilities as in threshold sampling, and denote this scheme SYS^. Hence for each item i, 
we have identical marginal distributions Wi with THR^ and SYSfc. 

Priority sampling (PRI) In priority sampling from [6] we sample a specified number oi k < n 
samples. For each item, a we generate a uniformly random number rj G (0, 1), and assign it a 
priority qi = Wi/vi. We assume these priorities are all distinct. The k highest priority items are 
sampled. We call the {k + l)th highest priority the threshold r. Then i is sampled if and only if 
Qi > T, and then the weight estimate is Wi = max{r, tt;,}. This scheme is denoted PRIfc. 

Note that the weight estimate Wi = max{r, Wj} depends on the random variable r which is 
defined in terms of all the priorities. This is not a Horvitz-Thompson estimator. In [6] it is proved 
that this estimator is unbiased, and that there is no covariance between individual estimates for 
k>l. 



1.4 Variance optimality of known sampling schemes 

Below we compare Vmm and Wp for the different sampling schemes. Using Theorem [1] most results 
are derived quite easily from existing knowledge on 'EV and VS. The derivation including the 
relevant existing knowledge will be presented in Sections HHSl 

When comparing different sampling schemes, we use a superscript to specify which sampling 
scheme is used. For example V^.^ < means that the sampling scheme $ obtains a smaller 
value of Vm-n than does 

For a given set of input weights wi, ...Wn, we think abstractly of a sampling scheme as a proba- 
bility distribution $ over functions w mapping items i into estimates Wi. We require unbiasedness 
in the sense that Et£<_<j,[tt;j] = Wi. For a given Wi £ <I>, the number of samples is the number of 
non-zeroes. For any measure over sampling schemes, we use a superscript OPT^ to indicate the 
optimal value over all sampling schemes using an expected number of at most k samples. For 
example, V^^^'' is the minimal value of V^.^ for sampling schemes $ using an expected number 
of at most k samples. 



Optimality of SYS, THR, and PRI For any subset size m and sample size k, we get 

T/OPTfc ^ T^SYSfc ^ n-m THRfc /^N 
'^m:n ^ m:n ^ ^ m.:n \ ' J 

The input weights ifi, ,Wn where arbitrary, so we conclude that systematic threshold sampling 
optimizes Vm-.n for any possible input, subset size m, and sample size, against any possible sampling 
scheme. For contrast, threshold sampling is always off by exactly a factor ""^ 
Similarly, for any subset inclusion probability p, we get that 



n—m 



From [22], we get that 



W^pOPT, ^ ^^SYS. = (1 _ p) w™^ (8) 

^PRIfc+1 yTHRfe < yPRIfc /qN 

< WfHR, <^PRI, 



5 



Hence, modulo an extra sample, priority sampling is as good as threshold sampling, and hence at 
most a factor ^/{^ ~ P) worse than the optimal systematic threshold sampling. 

Anti-optimality of U— R and P+R We argue that standard sampling schemes such as uni- 
form sampling and probability proportional to size sampling with replacements may be arbitrarily 
bad compared with the above sampling schemes. The main problem is in connection with heavy 
tailed weight distributions where we likely have one or a few dominant weights containing most of 
the total weight. With uniform sampling, we are likely to miss the dominant weights, and with 
probability proportional to size sampling with replacement, our sample gets dominated by copies of 
the dominant weights. Dominant weights are expected in the common case of heavy tailed weight 
distributions [U [lOj . 

We will analyze a concrete example showing that these classic schemes can be arbitrarily bad 
compared with the above near-optimal schemes. The input has a large weight Wn = i and n — 1 
unit weights Wi = 1, i £ [n — 1]. We are aiming at k samples. We assume that i ^ n ^ k ^ 1 and 
£ > k'^. Here x ^ y <^=^ x = uj{y). For this concrete example, we will show that 

V^^J' « {n- m)m/k 
V^-""' > fm/k 



Here x ~ y <^=^ x = (libo(l))y and x > y <^=^> x > (1 — o(l))y. A corresponding set of relations 

can be found in terms of p, replacing n — m with n{l—p) and m withpn. We conclude that uniform 
sampling with replacement is a factor i'^/n from optimality while probability proportional to size 
sampling with replacement is a factor £/n from optimality. Since ^ ^ n it follows that both schemes 
can be arbitrarily far from optimal. 

1.5 Discussion 

One of our conclusions above is that systematic threshold sampling is optimal for the average subset 
variances no matter the subset size m or inclusion probability p. However, there may be scenarios 
where some sampling schemes are not appropriate. In the Section [2] we will study a streaming 
scenario ruling out both threshold and systematic threshold sampling, leaving us with priority 
sampling among the near-optimal schemes. From ([8]) and (jlOp . we get that 



' m:n 

n — m 



'-y^^^' (11) 



Even if we don't know what the optimal appropriate scheme is, this inequality provides a limit to 
the improvement with any possible scheme. In particular, if k is not too small, and m is not too 
close to n, there is only limited scope for improvement. 

1.6 Contents 

The rest of the paper is divided as follows: In Section[2]we discuss our results in concrete application 
scenarios including survey sampling and related experimental work. In Section [3] we prove Theorem 
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[TJ In Sections SHS] we will derive the optimability results. In Section [6] we discuss extensions to 
biased sampling, and finally we have some concluding remarks in Section [71 

2 Application scenario and related experimental work 

In this section, we shall discuss an important Internet related application where systematic and 
threshold sampling are less appropriate, hence where the better choice is to settle for near-optimality 
of priority sampling. 

The setup of the scenario in this section is taken from ^ [TS] . It serves to contextualize the 
preceding optimality results in a realistic context. We will also mention related experimental work 
from p] complementing the analytic results of this paper. 

2.1 Reservoir sampling 

We are here focusing on reservoir sampling (c.f. ^lOj and [16^ p. 138-140]) for a stream of weighted 
items. In reservoir sampling, the items arrive one by one, and a reservoir maintains a sample S of 
the items seen thus far. When a new items arrives, it may be included in the sample S and old 
samples may be dropped from S. Old items outside S are not reconsidered. Reservoir sampling 
addresses two issues: 

• The streaming issue [17j where we want to compute a sample from a huge stream that passes 
by only once, when the memory available to us is limited. 

• The incremental data structure issue of maintaining a sample as new items are added. In 
our case, we use the sample to provide quick estimates of sums over arbitrary subsets of the 
items seen thus far. 

The reader is referred to pL5j for a description of how reservoir sampling and subset sum estimation 
can be integrated in a data base style infrastructure for a streaming context. 

2.2 Relation to survey sampling 

The above set-up is similar to that of classic survey sampling (see, e.g. [20J). However, in survey 
sampling, typically, we do not know the weight Wi of an item i unless we sample it. Instead we 
have free access to an auxiliary variable Ui that is correlated with Wi, and use Ui to determine the 
sampling probability pi for item i. For example, if the item i is a house and Wi is house hold income, 
then Ui could be an approximation of Wi based on the address. We can then use Ui to determine 
the sampling probability pi for item i. The weights Wi will only be found for the items sampled. 

The previously discussed techniques provide an estimate Ui of the known variable Ui, and then 
we use Wi = WiUi/ui as an estimator for Wi. If Ui is an unbiased estimator, then so is Wi, that is, 
'E[wi] = WiFj[ui]/ui = Wi. Also, if Ui is a Horvitz-Thompson estimator, then so is Wi, that is, if i is 
sampled, then Wi = WiUi/ui = Wi{ui/pi)ui = Wi/pi. 

In survey sampling, the main challenge is often to estimate the total W[n] based on the sampled 
weights. They often have an analysis of VTi assuming that Wi = Ui, and then they use this to 
indicate that a scheme will be good if Wi ~ Uj. For example, it is known that ys^"^^*^ = |20i pp. 
88,96,97], and that threshold sampling minimizes VT, among all Poisson sampling schemes [20\ p. 
86]. 
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Knowing the exact weight comes in naturally in computer science when the purpose of the 
sampling is to reduce a large data set so that we can later support fast approximate aggregations 
over arbitrary subsets. For example, this idea is used in data bases [U [121 IIBl IE]- Nevertheless, 
there could be cases where sampling is made with one weight Ui in mind, but later used for another 
weight Wi. This case is treated as in survey sampling. In case of heavy tailed distributions, uniform 
sampling is basically useless. Hence it is very important that Ui is large when Wi is large. 

Our context is that of a large stream of weighted items passing by. When item i passes by, we 
get to see its weight Wi. If our goal was to compute tfj^j, we would simply accumulate the weights 
in a counter. Hence, in our context, the challenge of survey sampling is trivial. 

One thing that makes reservoir sampling hard is that sampling decisions are made on-line. This 
rules out off-line sampling schemes such as Sunter's method |21].|20t p. 93-97] where we have to 
sort all the items before any sampling decisions are made. 

A cultural difference between survey sampling and our case is that survey sampling appears 
less focused on heavy tailed distribution. For threshold or systematic threshold sampling one can 
then assume that the threshold is bigger than the maximal weight, hence that these schemes use 
probabilities proportional to size. In our kind of applications, heavy tailed distributions are very 
prominent [H [19] . 

2.3 Internet traffic analysis 

With a concrete Internet example, we will now illustrate the selection of subsets and the use of 
reservoir sampling for estimating the sum over these subsets. For the selection, the basic point is 
that an item, besides the weight, has other associated information, and selection of an item may be 
based on all its associated information. As stated in ([2]), to estimate the total weight of all selected 
items, we sum the weight estimates of all selected sampled items. 

Internet routers export information about transmissions of data passing through. These trans- 
missions are called flows. A flow could be an ftp transfer of a file, an email, or some other collection 
of related data moving together. A flow record is exported with statistics such as application type, 
source and destination IP addresses, and the number of packets and total bytes in the flow. We 
think of the byte size as the weight. 

We want to sample flow records in such a way that we can answer questions like how many bytes 
of traffic came from a given customer or how much traffic was generated by a certain application. 
Both of these questions ask what is the total weight of a certain selection of flows. If we knew in 
advance of measurement which selections were of interest, we could have a counter for each selection 
and increment these as flows passed by. The challenge here is that we must not be constrained to 
selections known in advance of the measurements. This would preclude exploratory studies, and 
would not allow a change in routine questions to be applied retroactively to the measurements. 

A killer example where the selection is not known in advance was the tracing of the Internet 
Slammer worm [24j. It turned out to have a simple signature in the flow record; namely as being 
udp traffic to port 1434 with a packet size of 404 bytes. Once this signature was identified, the 
worm could be studied by selecting records of flows matching this signature from the sampled flow 
records. 

We note that data streaming algorithms have been developed that generalize counters to provide 
answers to a range of selections such as, for example, range queries in a few dimensions |17j . 
However, each such method is still restricted to a limited type of selection to be decided in advance 
of the measurements. 
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Figure 1: Estimation of 8 x 8 traffic matrix. 



2.4 Experiments on real Internet data 

In [9], the above Internet application is explored with experiments on a stream segment of 85,680 
flow records exported from a real Internet gateway router. These items were heavy tailed with a 
single record representing 80% of the total weight. Subsets considered were entries of an 8 x 8 traffic 
matrix, as well as a partition of traffic into traffic classes such as ftp and dns traffic. Figure [T] shows 
the results for the 8x8 traffic matrix with all the above mentioned sampling schemes (systematic 
threshold sampling was not included in [9], but is added here for completeness). The figure shows 
the relative error measured as the sum of errors over all 64 entries divided by the total traffic. The 
error is a function of the number k of samples, except with THR, where k represents the expected 
number of samples. 

We note that U— R is worst. It has an error close to 100% because it failed to sample the 
large dominant item. The P+R is much better than U+R, yet much worse than the near-optimal 
schemes PRI, THR, and SYS. To qualify the difference, note that P+R use about 50 times more 
samples to get safely below a 1% relative error. 

Among the near-optimal schemes, there is no clear winner. From our theory, we would not 
expect much of a difference. We would expect THR to be ahead of PRI by at most one sample. 
Also, we are studying a partitioning into 64 sets, and then, as noted in Section II. 2^ the average 
variance advantage of SYS is a factor 1 — 1/64, which is hardly measurable. 

The experiments in Figure [1] are thus consistent with our theory. The strength of our mathe- 
matical results is that we now know that no one can ever turn up with a different input or a new 
sampling scheme and perform better on the average variance. Conversely, experiments with real 
data could illustrate subsets with relevant special properties that are far from the average behavior. 

2.5 Resource constrained reservoir sampling 

Our analysis names systematic threshold sampling the best possible sampling scheme. However, 
in reservoir sampling we often have a resource bound on the number k of samples we can store, 
e.g., we may only have a certain amount of memory available for the samples. Priority sampling 
is ideally suited for this context in that a standard priority queue can maintain the k + 1 items 
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of highest priority (when a new item arrives, it is first assigned a priority, then it is added to the 
queue, and finahy we remove the item of smallest priority item from the queue in 0(log k) time. 

However, with both threshold sampling and priority sampling it appears that 
we need to know the threshold r in advance (item i is sampled with prob- 
ability min{l, lUj/r}). This threshold r is a function of all items such that 
^ min{l, ifj/r} = k. Hence r can only be determined after the whole stream has been 
investigated. 

As described in [9] it is possible, though a bit more complicated, to adapt threshold sampling 
for a stream to provide an expected number of k samples. The essential idea is that we increase 
the threshold as we move along the stream in such away that it always gives an expected number 
of k samples from the items seen thus far. Thus an item i gets dropped from the sample when 
the threshold falls below its priority. However, if we want to be sure to no more than k samples 
are made, we have to shoot for substantially less than k samples. For example, to stay below k 
with 99% probability, using Normal approximation for larger k, we should only go for an expected 
number of — 2.3^/k samples. In contrast, with priority sampling, we do better than threshold 
sampling with an expected number of /c — 1 samples. Thus priority sampling works better when we 
are allowed at most k samples. 

For systematic threshold sampling, the problem is more severe because if one changes the 
threshold marginally, it may completely change the set of samples. One could conceivably resolve 
this if we only increased the threshold by an exact doubling starting. However, a doubling of the 
threshold can be shown to at least double the variance. Another objection to systematic threshold 
sampling in a streaming context is that we may have a very strong correlations between items in a 
subset depending on how they are placed in the stream. Normally, it is recommended that the items 
are appropriately shuffled [20^ p. 92] , but that is not possible in reservoir sampling from a stream. 
With threshold and priority sampling there is no such dependence as there is no covariance between 
different item estimates. As demonstrated in [23], it is possible to get good confidence bounds with 
priority sampling and threshold sampling so that we statistically know when we get good estimates 
for a subset. The correlation between items in systematic threshold sampling prevents us from 
providing good confidence intervals, so even if systematic threshold sampling gives better variance 
on the average, we have no way of knowing if we get these good estimates for a concrete subset. 

Thus, among our near optimal sampling schemes, priority sampling is the most appropriate for 
resource constrained reservoir sampling. 

2.6 On the suboptimality of priority sampling 

Recall from ([IT]) that 

T/PMfc+i < ^~ 1 T/OPTfc 
n — m 

In our Internet application we typically have thousands of samples. Hence we are not concerned 
about the difference between k and k + \ samples. 

The factor is only significant for larger sets m. However, for larger sets, we expect to 

do great anyway because they relatively speaking have much smaller errors. More precisely, we 
typically expect that we have plenty of samples go get very a good estimate of the total, or in other 
words, that the relative standard deviation Sn-.n = ^/V^/w^n] foi^ the total is very small. 

Since priority sampling has no covariance, Vm-n = rn/n ■ VS. At the same time, the average 
subset sum is m/n ■ w^n]- For a subset achieving both of these averages, the relative standard 
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deviation would be 

■sjmjn ■ 
m/n ■ Win] 



mEr 



However, if ^pnjrri is big, then the optimahty factor ^-m close to 1. 

Thus, it is when our variance is expected comparatively small that our relative distance to OPT 
is greatest, the most extreme being in the estimation of the total. The estimate of the total has the 
smallest relative standard deviation, but since it is positive, it is infinitely worse than V^-K^^ = 0- 

Another case where we do not need to worry so much about the non-optimality factor is 
if we are interested in the relative weight of a subset / of size m. As an estimator, we use wi • 
If m > 're/2, we note that wilw^n] = 1 ~ ''^[n]\//^[n]- Most of the error in this estimate stems 
from the estimate of the small set [n] \ /, but for this small set size, we are at most a factor 

n-(n72+i) < 2 f'^o™ optimality. 

As discussed in Section [2.51 we do not know if there is a scheme performing better than priority 
sampling in practice in the context of resource constrained reservoir sampling. The conclusion of 
this section is that even if there is a better scheme, it is not going to help us much. 

3 Proof of basic theorem 

In this section we prove dH]) 

Vm:n = — + VY. 

n \ n — \ n — 1 J 

and ® 

Wp = p {{I - p)YV + pVE) . 
By the definitions of variance and covariance, for any subset / C [n], 

Yai[wi] = Aj + Bj 

where 

Ai = ^Var[u;i] 

iinI 

Bj = Yl CoY[wi,Wj] 

Suppose I is chosen uniformly at random among subsets of [n] with m element. Then for any i, 
Pr[z S /]= m/n, so by linearity of expectation. 



E[Ai] = ^ Pr[i G /]Var[t 



= m/n-A[n]- 

Also, for any j ^ i, Pr[i,j £ I] = m/n ■ [m — l)/(n — 1), so by linearity of expectation, 

= m/n ■ {m — l)/(n — 1) • 
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Thus 

E[Var[?i/]] = m/n ■ ^[„] + m/n ■ (m - l)/(n - 1) • (12) 
By definition, Ar„i = SI/. Moreover, by (fT2]) . 



so 



VT. = A^n] + Bin] 



Bin] = vj:- j:v. 



Consequently, 



Vm:n = E[Yar [wj]] 



n n n — 1 

m f n — m m — 1 

+ 



n \ n — 1 n — 1 

This completes the proof of ([5]). 

The proof of ([6]) is very similar. In this case, each i E [n] is picked independently for /' with 
probability p. By linearity of expectation, 

E[Ai]=pA[n]. 

Also, for any j ^ i, Pr[i, j S /] = p^, so by linearity of expectation, 

E[Bi]=p^Bin]. 

Thus 

V; = E[YaT[wi]] 

= p{{l - p)^V + pV^) 



This completes the proof of ([6]) , hence of Theorem [H 

4 Near-optimal schemes 

We will now use Theorem[T]to study the average variance (near) optimality of subset sum estimation 
with threshold sampling, systematic threshold sampling, and priority sampling for any possible set 
of input weights. The results are all derived based on existing knowledge on and VT,. Below 
we will focus on Vmm based on random subsets of a given size m. The calculations are very similar 
for Wp based on the inclusion probability p. 

It is well-known from survey sampling that [20' pp. 88,96,97] that systematic sampling always 
provides an exact estimate of the total so FS^^^'' = 0. Since variances cannot be negative, we have 
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It is also known from survey sampling \20\ p. 86] that threshold sampling minimize VE among all 
Poisson sampling schemes. In [9] it is further argued that threshold sampling minimizes T,V over 
all possible sampling schemes, that is, Sl/™^*^ = Sy^^"""*. Since systematic threshold sampling 
uses the same marginal distribution for the items, we have 

^yTHRfc _ ^ySYSfe _ j-jyOPTfc 

Since SYS^ optimizes both T,V and VT, we conclude ([5]) that it optimizes Vm-.n for any subset size 
m. More precisely, using ([5|), we get 



n \ n — 1 ^ n— 1" / 



m f n — m ^opt m — 1 



nVn— 1 n— 1 



< T/OPTfc < T^SYSfc 

Hence 

T/SYSfc ^ T^OPTfe 



As mentioned above we have TiV'^^^'' = Moreover, threshold sampling has no covariance 

between individual estimates, so 



n n 



But in the previous calculation, we saw that 



Hence we conclude that 



T/SYS;, _ m n-m ^^^qpt^ 
n n — 1 



T^SYSfe ^ ^ ~ T/THRfe 
n — 1 



This completes the proof of ([5]). A very similar calculation establishes ([6]). 
In ^22j it is proved that 

^^PRIfe+i ^ j-j^THRfc ^ ^^PRIfc 

Moreover, for any scheme $ without covariance, we have 

"mm ^ ^'^ 

Since both threshold and priority sampling have no covariance, we conclude ([9]) 

^PRIfe+i T^THRfe < T^PRIfe 
Vm:n ^ ►'mm — •'mm 

The proof of (fTOl) is similar based on PV^ = pHV^. 
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5 Anti-optimal schemes 



Below we will analyze a concrete example showing that the classic schemes of uniform sampling 
without replacement and probability proportional to size sampling with replacement can be arbi- 
trarily bad compared with the above near-optimal schemes. 

The concrete example consists of n — 1 unit weights Wi = 1, i £ [n — 1] and a large weight 
Wji = £■ We are aiming at k samples. We assume that £ ^ n ^ k ^ 1 and that £ 3> fe^. 

As in the last section, we focus on the subset size m rather than the inclusion probability p. 



5.1 Threshold samphng 

We will now analyze the variance with threshold sampling for the bad example. The variance with 
systematic and priority sampling will then follow from ([7]) and ([9]). 

Threshold sampling (THR^) will use the threshold r = ^^Ej- This will pick the large weight 
Wn = i with probability p„ = 1 and weight estimate Wn = Wn- Hence Var[t()„] = 0. Each unit 
weight Wj, i < n, is then picked with probability pi = ;^5x ^-iid estimate = f^x- The variance 
of the estimate for a unit weight item is then pi{\ —pi)/Pi = (1 —pi)/pi = fEf; so the total 

variance is T,V = ]}}^ ~ Since there is no co- variance, we conclude for any subset size 
m < n that 

From d?]) we get that 



y™Rfe ^ . j^yTUR, _ ^^^^ 



^m-.n ~ Y ''^''^/^ [n — m)m/k 



Finally, since k = uj{l) it follows from ([9]) that 



5.2 Uniform samphng without replacement 

In uniform sampling without replacement (U-Rfc), we pick a sample of k different items uniformly 
at random. As we shall see below, the variance of uniform sampling is dominated by the variance 
of estimating the large weight. 

The large weight Wn = ^ is picked with probability Pn = k/n and estimate i/p. Hence 

nwl]=Pn{£/Pnf =ne/k. 

It follows that 

Var[u;„] = E[w^] -wl = nl^/k - ^ nl^/k 

Hence 

T.V > Ya.i[wn] ^ nf/k. 
To study the variance VT, of the total sum estimate w^n] , we note that 

E[^^;f„]] > E[wl] = nf/k. 

Hence 

= E[wfj - wf, > nf/k - + n - 1)2 « nl^/k 
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Since T,V and VT, are both lower bounded by (1 — o{l))n£'^ /k, it follows from ^ that for any 
subset size m < n 

^m"^' > {m/n)nf/k = mf/k 
This is roughly a factor i'^ worse than what we had with any of the near optimal schemes. 

5.3 Probability proportional to size sampling with replacement 

In probability proportional to size sampling with replacement (P+R^), each sample Sj G [n], j G [k], 
is independent, and equal to i with probability Wi/w^n]- An item i is sampled if i = Sj for some 
j G [k]. This happens with probability = 1 — (1 — Wi/w^n])'^^ ^^'^ if ^ is sampled, wi = l/pi- 

In our bad example, the variance with P+Rfc relates to the fact that we get mostly duplicates 
of the large item. The expected number of unit samples is only n — l/{n + 1), and as a result, we 
get a large variance from the unit items. 

Each unit item is picked with probability 

= 1 - (1 - l/(^ + n - l))^' w k/L 

Hence 

EF> (n- l)Var[u'i] = (n - l)pi(l - ^n£/k. (13) 

This is a factor i/n worse than with threshold sampling. 
We will now show that VT, > TV, or equivalently, that 

TV-VT, = o{n£/k). 

By definition 

VT, = TV + 2^ {E[wiW[i_i]] - WiWii^i]) , 

i>l 

SO 

TV -VT = 2 ^ {wiW[i_i-i - E[wi?i[j_i]]) 

i>l 

= 2^{wiiw[i_i]-E[w[i_,]\i£S]) (14) 
To bound this sum, first we consider the term with i = n. 

Wlrr-l] = E[u)[„_i]] = PnE[w[n^i]\n E S] + (1 - p„)E[ui[„_i] |n ^ S] 

so 

W[n-i] - E[w[n-i]\n G 5] > (1 -p„)E[ui[„_i]|n S] 

Here (1 -p„) = Pr[z ^ S] = ((n - l)/(^ + n - 1))^ < (n/i)''. Moreover E[w[^_i]\n ^ S] < k/pi ^ i, 
so (1 — pn)E[w[n-i]\n ^ S] = o{n/k). Hence 

Wn {w[n-i] - E[?i)[„_i]|i G S]) = o{in/k). (15) 
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as desired. Next we consider i < n. We have 



and 



Wi{w[i^i] - E[w[j_i]|i G S]) 

= {i-l)-{i-l)PT[le S\ie S]/pi] 



Fi[l £ S\i £ S]/pi > {i - I) Fiil G S\Sk = i]/pi 
> Pi[l £ S\Sk = i]/pi 
1 - (1 - l/(^ + n- 



> 



> 



> 



1 - (1 - l/(£ + n- 1))*^- 
l-{l-l/£)^-^ 



1 - (1 - 1/e)'' 

k~ 
2£ 



e k 

> 1-1 k — 

= l-0{l/k) 



The last derivation follows because £ > k^. Hence 

{i-l)-ii- 1) Pr[l G S\i £ S]/pi) = 0{i/k) 

so 

n-1 



n-1 

= 0{i/k) = 0{n^/k) = o{ni/k) (16) 

1=2 

Combining ([HD, ([IS]), and ([IS]), we conclude that -VT. = o{ni/k), hence that 

> > n£/k 

Together with (jl3p and Theorem [H it follows for any set size m, that 

> {m/n)n£/k = m£/k 

This is a factor l/n more than T/^.^^''. 

6 Biased estimators 

So far we have restricted our attention to unbiased estimators. With biased estimators we would 
consider mean square error (MSE) instead of just variance. We note that even though a biased 
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estimator may give a smaller MSE than an unbiased one, there are many standard reasons to 
prefer unbiased estimators. For example, if we want to combine estimates in a sum, we can use 
linearity of expectation to conclude that the sum of the estimators is unbiased if each estimator 
is unbiased. Also, if we add independent unbiased estimators, the variances are just added. With 
biases, we cannot just add up the mean square errors. An example where we wish to combine 
independent estimators is if we have independent samples from different streams. In the Internet 
application, these streams could be flow records from different routers where would want to combine 
the information in a global picture [8]. In other words, a biased estimator may be OK if all we 
consider is a single isolated estimate. However, as soon as we start combining estimates, the bias 
may come back and haunt us. 

Despite these caveats of biased estimators, we discuss them briefly below to see how they fit into 
subset sum estimation. As a concrete example of biased estimation, suppose a sampling scheme 
does not provide exact estimation of the total, but that the total is known. Then, for each item i, 
we could use the adjusted estimator x'^'^-^ = Xj(x[„]/x[„]). Then the total is right in the sense that 
xj^^j' = In the case of threshold sampling with no dominant weights, this adjusted estimator 

is equivalent to the estimator suggested in |20i p. 87]. The adjusted threshold sampling estimator 
will bias towards large weights. However, the corresponding adjusted uniform sampling estimator 
will have bias towards smaller weights. 

Now, if we allow bias, how well can we do with respect to our average mean square error? It 
can easily be seen that our main theorem holds for mean square error and not just for variances. 
That is, with MSE denoting mean square error instead of V for variance, we get the following 
generalization of ([5]): 

MSE^.n = -( ^ J^MSE + MSEJ:] (17) 

n \ n — 1 n — 1 ) 

In fact, our formulas generalize to any symmetric quadratic polynomial. As with the variance of 
unbiased estimators, we can use (I17p to compute MSEravn for a concrete sampling scheme for which 
we know Y^MSE and MSET.. 

Now, if we want to minimize averages and there is no requirement of unbiasedness, the optimal 
performance is obtained by a concrete sample, thus with no randomness in the sample. Assume 
that the weights are in decreasing order so that wi is the largest weight. If all we cared about was 
MSETi, we could give some item the total weight, and drop the rest. If all we cared about was 
T,MSE, the optimal choice is to pick the k largest weights, using their real weights as the estimate. 
Then J:MSE = Z^>ky^l 

To optimize MSEm-n, we introduce a parameter X for the negative error w^n] ~ ^[n] i^i the 
total. Then MSET, = X"^ . To minimize TjMSE, the optimal choice is to pick the k largest 
weights, setting the rest to 0. For the k largest weights, we distribute the error equally, setting 
w, = w, + iZ,^,Wi-X)/k. Theui:MSE = k{{Y:^^^w,-X)/kf + Y:,>^w} = {Y.^^^w,-Xf/k+ 
^i>k '^'i- "^^^ l^-st term is fixed, so to optimize MSEm-n-, we should choose X so as to minimize 

2^ Wi - Xfk + X^. 

n — 1 ^-^ n — 1 

For m = 1, we choose X = Yli>k '^i-> then Wi = Wi iov i < k as discussed above. 

Obviously, picking the k largest weights and giving them a specific estimate is not a good 
"sampling" scheme. The above more illustrates the danger of just looking at averages and the 
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deceptiveness of biased estimation. For non-random subsets such as a large set of small items, the 
above scheme would always return a zero. This kind of unfairness isn't right. Recall that we had 
a similar criticism of systematic sampling in Section 12.51 if we could not shuffle the items. 

An ideal sampling scheme should both have a reasonable fairness and perform reasonably well 
on the average. Threshold and priority sampling have no covariance, so all partitions have the 
same total variance. Here by considering partitions rather than individual subsets, we ensure that 
each item is counted exactly once. Moreover, among unbiased schemes, they essentially got within 
a factor ^Erf from optimality on the average variance for subsets of size m, so when m is not too 
close to n, this is close to ideal. 



7 Concluding remarks 

As a formal measure for ability to estimate subset sums of a set of n weighted items, we suggested 
for each set size m, to study the average variance over all subsets: 

Vm:n = E/c[n],|/|=m [Var[u)/]/m] 

We discovered that Vm-n was the following simple combination of the sum of variances TiV and the 
variance of the total sum VE,: 

Vra.n = " + VY. 

n \ n — \ n — 1 

A corresponding formula was found for the expected variance Wp for subsets including each item 
independently with probability p. 

We then considered different concrete sampling schemes. The optimality of and VT, was 
already known for some sampling schemes, and this now allow us to derive the optimality with 
respect to Vm-n for arbitrary subset size m. 

We found that systematic threshold sampling was optimal with respect to Vm-.n, and that thresh- 
old sampling was off exactly by a factor ^E^- Finally, we know that priority sampling performs 
like threshold sampling modulo one extra sample. We argued that this distance to optimality is 
not significant in practice when we use many samples. This was important to know in the context 
of resource constrained reservoir sampling, where priority sampling is the better choice for other 
reasons. 

For contrast, we also showed that more classic schemes like uniform sampling with replacement 
and probability proportional to size sampling without replacement could be arbitrarily far from 
optimality. The concrete example was stylistic heavy tailed distribution. 



References 

[1] R.J Adler, R.E. Feldman, and M.S. Taqqu. A Practical Guide to Heavy Tails. Birkhauser, 
1998. 

[2] N. Alon, N. Duffield, C. Lund, and M. Thorup. Estimating arbitrary subset sums with few 
probes. In Proc. 24th PODS, pages 317-325, 2005. 

[3] D. K. Burleson. Inside oraclelOg dynamic sampling. 

|http: //www, dba-oracle . com/art _dbazine_oracle lOg_dynaniic _sain^ lingJiint . htm. 



18 



[4] S. Chaudhuri, R. Motwani, and V.R. Narasayya. On random sampling over joins. In Proc. 
ACM SIGMOD Conference, pages 263-274, 1999. 

[5] E. Cohen. Size-estimation framework with applications to transitive closure and reachability. 
J. Comput. Syst. Sci., 55(3):441-453, 1997. 

[6] N.G. Duffield, C. Lund, and M. Thorup. Flow sampling under hard resource constraints. In 
Proc. ACM IFIP Conference on Measurement and Modeling of Computer Systems (SIGMET- 
RlCS/Performance), pages 85-96, 2004. 

[7] N.G. Duffield, C. Lund, and M. Thorup. Learn more, sample less: control of volume and 
variance in network measurements. IEEE Transactions on Information Theory, 51(5):1756- 
1775, 2005. 

[8] N.G. Duffield, C. Lund, and M. Thorup. Optimal combination of sampled network measure- 
ments. In Proc. 5th ACM SIGCOMM Internet Measurement Workshop (IMC), pages 91-104, 
2005. 

[9] N.G. Duffield, C. Lund, and M. Thorup. Sampling to estimate arbitrary subset sums. Technical 
Report |cs.I)S/0509026| Computing Research Repository (CoRR), 2005. Preliminary journal 
version of [6]. 

[10] C.T. Fan, M.E. Muller, and I. Rezucha. Development of sampling plans by using sequential 
(item by item) selection techniques and digital computers. J. Amer. Stat. Assoc., 57:387-402, 
1962. 



[11] Oracle User's Co-Operative FAQ. http://www.jlcomp.demon.co.uk/faq/randoiii.htinl 



[12] M.N. Garofalakis and P.B. Gibbons. Approximate query processing: Taming the terabytes. 
In Proc. 27th VLDB, page Tutorial 4, 2001. 

[13] P. J. Haas. Speeding up db2 udb using sampling. 



|http : //www . almaden ■ ibm. com/cs/people/peterh/ idugjbig.pdf 



[14] J.M. Hellerstein, P.J. Haas, and H.J. Wang. Onhne aggregation. In Proc. ACM SIGMOD, 
pages 171-182, 1997. 

[15] T. Johnson, S. Muthukrishnan, and I. Rozenbaum. Sampling algorithms in a stream operator. 
In Proc. ACM SIGMOD, pages 1-12, 2005. 

[16] D.E. Knuth. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms. Addison- 
Wesley, 1969. 

[17] S. Muthukrishnan. Data streams: Algorithms and applications. Foundations and Trends in 
Theoretical Computer Science, 1(2), 2005. 

[18] F. Olken and D. Rotem. Random sampling from databases: a survey. Statistics and Computing, 
5(l):25-42, 1995. 

[19] K. Park, G. Kim, and M. Crovella. On the relationship between file sizes, transport protocols, 
and self-similar network traffic. In Proc. 4th IEEE Int. Conf. Network Protocols (ICNP), 1996. 



19 



[20] C-E. Sarndal, B. Swensson, and J. Wretman. Model Assisted Survey Sampling. Springer, 1992. 

[21] A. B. Sunter. List sequential sampling with equal or unequal probabilites without replacement. 
Applied Statistics, 26:261-268, 1977. 

[22] M. Szegedy. The DLT priority sampling is essentially optimal. In Proc. 38th ACM Symp. 
Theory of Computing (STOC), pages 150-158, 2006. 

[23] M. Thorup. Confidence intervals for priority sampling. In Proc. ACM IFIP Conference on 
Measurement and Modeling of Computer Systems (SIGMETRICS/Performance), pages 252- 
253, 2006. 

[24] Slammer worm, http : / / securityresponse . Symantec . com/ avcenter/venc/ data/w32 . sql 3xp . worm . html. 



20 



